Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer's quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
translated by 谷歌翻译
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
translated by 谷歌翻译
The mainstream crowd counting methods regress density map and integrate it to obtain counting results. Since the density representation to one head accords to its adjacent distribution, it embeds the same category objects with variant values, while human beings counting models the invariant features namely similarity to objects. Inspired by this, we propose a rational and anthropoid crowd counting framework. To begin with, we leverage counting scalar as supervision signal, which provides global and implicit guidance to similar matters. Then, the large kernel CNN is utilized to imitate the paradigm of human beings which models invariant knowledge firstly and slides to compare similarity. Later, re-parameterization on pre-trained paralleled parameters is presented to cater to the inner-class variance on similarity comparison. Finally, the Random Scaling patches Yield (RSY) is proposed to facilitate similarity modeling on long distance dependencies. Extensive experiments on five challenging benchmarks in crowd counting show the proposed framework achieves state-of-the-art.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
深度学习已在许多神经影像应用中有效。但是,在许多情况下,捕获与小血管疾病有关的信息的成像序列的数量不足以支持数据驱动的技术。此外,基于队列的研究可能并不总是具有用于准确病变检测的最佳或必需成像序列。因此,有必要确定哪些成像序列对于准确检测至关重要。在这项研究中,我们旨在找到磁共振成像(MRI)序列的最佳组合,以深入基于学习的肿瘤周围空间(EPV)。为此,我们实施了一个有效的轻巧U-NET,适用于EPVS检测,并全面研究了来自易感加权成像(SWI),流体侵入的反转恢复(FLAIR),T1加权(T1W)和T2的不同信息组合 - 加权(T2W)MRI序列。我们得出的结论是,T2W MRI对于准确的EPV检测最为重要,并且在深神经网络中掺入SWI,FLAIR和T1W MRI可能会使精度的提高无关。
translated by 谷歌翻译
情感语音综合旨在使人类的声音具有各种情感影响。当前的研究主要集中于模仿属于特定情感类型的平均风格。在本文中,我们试图在运行时与情感混合在一起。我们提出了一种新颖的表述,可以衡量不同情绪的语音样本之间的相对差异。然后,我们将公式纳入序列到序列情感文本到语音框架中。在培训期间,该框架不仅明确地表征了情感风格,而且还通过用其他情感量化差异来探索情绪的序数。在运行时,我们通过手动定义情感属性向量来控制模型以产生所需的情绪混合物。客观和主观评估验证了拟议框架的有效性。据我们所知,这项研究是关于言语中混合情绪的建模,综合和评估混合情绪的第一项研究。
translated by 谷歌翻译
大规模的视觉预训练在各种下游任务中都表现出了令人印象深刻的进步。现有方法主要是通过图像和文本的全局表示形式的相似性或对图像和文本特征上的高级交叉模式关注来对跨模式对齐进行建模。但是,由于只有全局图像文本对齐信息,因此他们无法明确学习视觉区域和文本短语之间的细粒语义对齐。在本文中,我们介绍了Loupe,这是一种精细的语义一致性视觉语言预训练框架,该框架从新颖的游戏理论互动的角度学习了细粒度的语义对齐。为了有效地计算游戏理论相互作用,我们进一步提出了一种不确定性感知的神经Shapley交互学习模块。实验表明,Loupe在图像文本检索基准测试中实现了最新的。如果没有任何对象级的人类注释和微调,Loupe就可以在对象检测和视觉接地方面实现竞争性能。更重要的是,Loupe从大规模的原始图像文本对学习细粒语义的新方向。
translated by 谷歌翻译
了解人类情绪是智能机器人提供更好的人类机器人相互作用的关键能力。现有作品仅限于修剪视频级别的情感分类,无法找到与情感相对应的时间窗口。在本文中,我们介绍了一项新任务,称为视频中的时间情感本地化(TEL),该任务旨在检测人类的情感并将其相应的时间边界定位在带有校准字幕的未修剪视频中。与时间动作本地化相比,TEL提出了三个独特的挑战:1)情绪的时间动态极为多样; 2)情绪提示都嵌入了外观和复杂的情节中; 3)细粒度的时间注释是复杂且劳动密集型的。为了应对前两个挑战,我们提出了一个新颖的扩张上下文集成网络,该网络与粗细的两流体系结构。粗流通过建模多粒性时间上下文来捕获各种时间动力学。细流通过推理从粗流的多晶格时间上下文之间的依赖性来实现复杂的理解,并将它们自适应地集成到细粒度的视频段特征中。为了应对第三个挑战,我们引入了跨模式共识学习范式,该范式利用了对齐视频和字幕之间的固有语义共识,以实现弱监督的学习。我们为新的测试集提供了3,000个手动注释的时间边界,因此可以对TEL问题进行未来的研究进行定量评估。广泛的实验显示了我们方法对时间情绪定位的有效性。这项工作的存储库位于https://github.com/yyjmjc/temporal-emotion-localization-in-videos。
translated by 谷歌翻译